WEEK 2: TIDY DATA + BASICS OF GRAPHICS

Tuesday, January 17th

Today we will…

Tidy Data

Tidy Data

Artwork by Allison Horst

Same Data, Different Formats

Which data follows a tidy data format?

Team Points Assists Rebounds
A 88 12 22
B 91 17 28
C 99 24 30
D 94 28 31
Team Variable Value
A Points 88
A Assists 12
A Rebounds 22
B Points 91
B Assists 17
B Rebounds 28
C Points 99
C Assists 24
C Rebounds 30
D Points 94
D Assists 28
D Rebounds 31

Tidy Data

Artwork by Allison Horst

Working with External Data

Common Types of Data Files

.csv : “Comma-separated”

Name, Age
Bob, 49
Joe, 40


.xls, .xlsx: Microsoft Excel Spreadsheet - Common approach: save as .csv - Nicer approach: readxl package

.txt: Plain text - Could be just text - Could be comma-separated data - Could be tab-separated, bar-separated, etc. - Need to let R know what to look for

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

  • read_csv() works like read.csv, with some extra stuff

  • read_tsv() is for tab-separated data

  • read_table() is for any data with “columns” (white space separating)

  • read_delim() is for special “delimiters” separating data

  • read_excel() is specifically for dealing with Excel files

Grammar of Graphics

Grammar of Graphics: graphic forms from the ground up

Think of a data visualization or graph as a mapping

  • from variables in the data set, (or statistics computed from the data)
  • to visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen

Grammar of Graphics: why both?

It’s not just a neat party trick!

  • More flexible than “chart zoo” of named graphs
  • Software understands the structure of your graph
    • easily automate small multiples for data subsets

Note

“[The grammar] makes it easier for you to iteratively update a plot, changing a single feature at a time. The grammar is also useful because it suggests the high-level aspects of a plot that can be changed, giving you a framework to think about graphics, and hopefully shortening the distance from mind to paper. It also encourages the use of graphics customised to a particular problem, rather than relying on generic named graphics.

Grammar of Graphics: components

GoG components, as specified in R’s ggplot2

  • data
  • aes : aesthetic mappings (position, length, color, symbol, …)
  • geom : geometric element (point, line, bar, …)
  • stat : statistical variable transformation (identity, count, linear model, quantile, …)
  • scale : scale transformation (log scale, color mapping, axes tick breaks, …)
  • coord : Cartesian, polar, map projection, …
  • facet : divide into subplots / small multiples using a categorical variable

Of course, we can also control axes, legends, titles … (guides)

Using ggplot2

How to build a graph

How to build a graph

This will begin a plot that you can finish by adding layers to.

ggplot(data = mpg)

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       )

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter()

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter() +
  geom_boxplot()

How would you change the code to have the points on top of the boxplots?

Aesthetics

In ggplot2, we map variables from the data set to aesthetics on the chart

Code
ggplot(data = txhousing, aes(x = date, y = median, color = city)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  xlab("Date") + ylab("Median Home Price") + 
  ggtitle("Texas Housing Prices")

Not an exhaustive list – see ggplot2 cheat sheet

  • x, y
  • color and fill
  • linetype
  • lineend
  • linejoin
  • size
  • shape

Global Aesthetics

ggplot(data = housingsub, 
       mapping = aes(x = date, 
                     y = median)
       ) +
  geom_point()

Local Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median)
             )

Mapping Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median,
                           color = city)
             )

Setting Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median, 
                           color = city), 
             color = "blue"
               )

Geometric objects

In ggplot2, we use a geom function to represent data points, and use the geom’s aesthetic properties to represent variables.

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_point() +
  labs(xlab = "City (mpg)", y = "Highway (mpg)")

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_text(aes(label = class)) +
  labs(xlab = "City (mpg)", y = "Highway (mpg)")

Not an exhaustive list – see ggplot2 cheat sheet

one variable

  • geom_density()
  • geom_dotplot()
  • geom_histogram()
  • geom_boxplot()

two variable

  • geom_point()
  • geom_line()
  • geom_density_2d()

three variable

  • geom_contour()
  • geom_raster()

Once our data is formatted and we know what type of variables we are working with, we can select the correct geom for our visualization.

Alternative method of building layers: Stats

A stat builds a new variable to plot (e.g., count and proportion)

Faceting

A way to extract subsets of data and place them side-by-side in graphics

Note

sometimes called small multiples

ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) + 
  geom_point() +
  facet_grid(~class)

  • facet_grid(. ~ b): facet into columns based on b
  • facet_grid(a ~ .): facet into columns based on a
  • facet_grid(a ~ b): facet into both rows and columns
  • facet_wrap( ~ fl): wrap facets into a rectangular layout

You can set scales to let axis limits vary across facets:

  • facet_grid(y ~ x, scales = "free"): x and y axis limits adjust to individual facets
    • “free_x” - x axis limits adjust
    • “free_y” - y axis limits adjust

You can also set a labeller to adjust facet labels:

  • facet_grid(. ~ fl, labeller = label_both)
  • facet_grid(. ~ fl, labeller = label_bquote(alpha ^ .(x)))
  • facet_grid(. ~ fl, labeller = label_parsed)

Position Adjustements

Position adjustments determine how to arrange geoms that would otherwise occupy the same space

  • position = 'dodge': Arrange elements side by side
  • position = 'fill': Stack elements on top of one another, normalize height
  • position = 'stack': Stack elements on top of one another
  • position = 'jitter": Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter())
ggplot(mpg, aes(fl, fill = drv)) + 
  geom_bar(position = "")`

Plot Customizations

Clearer labels with labs()

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x = "Engine Displacement (liters)", 
       y = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency")

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)", 
       ylab = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  theme_bw() +
  theme(legend.position = "bottom")

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_y_continuous("Highway MPG", 
                     limits = c(0,50),
                     breaks = seq(0,50,5)
                     )

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)",
       ylab = "Highway MPG",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_color_gradient(low = "white", high = "green4")

Formatting your Plot Code

Tip

Notice how there is a lot of nesting that happens within ggplot2 code (e.g., parentheses within parentheses). It is good practice to put each geom and aesthetic on a new line. This makes code easier to read!

The general guideline is that each line of your code should not be over 80 characters long.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(xlab = "City (mpg)", ylab = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, 
                     y = hwy, 
                     color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(xlab = "City (mpg)", 
       ylab = "Highway (mpg)"
       )
ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(xlab = "City (mpg)", ylab = "Highway (mpg)")

PA 2: Using Data Visualization to Find the Penguins

Artwork by Allison Horst

Tip

I encourage you to use your neighbors for support!

To do…

  • PA 2: Using Data Visualization to Find the Penguins
    • Due TOMORROW, Wednesday (1/18) at 8:00AM
  • Bonus Challenge: Ugly Graphics of Penguins (+2)
    • Due TOMORROW, Wednesday (1/18) at 10:10AM

Note

I have office hours TODAY, Tuesday (1/17) from 2:40pm - 3pm in 25-103

Wednesday, January 18th

Today we will…

  • Review PA 2: Using Data Visualization to Find the Penguins
  • Ugly Graphics of Penguins
  • Mini lecture on text material
    • What makes a good graphic?
  • Lab 2: Exploring Rodents with ggplot2
  • Challenge 2: Spicing things up with ggplot2

Why are some plots easier to read than others?

What makes bad figures bad?

Edward R. Tufte is a better known critic of this style of visualization:

  • Graphical excellence is the well-designed presentation of interesting data and consists of:
    • complex ideas communicated with clarity, precision, and efficiency
    • maximizes the “data-to-ink” ratio.
    • nearly always multivariate
    • requires telling the truth about the data.
  • defines “chartjunk” as superfluous details

bad data.

Looking at pictures of data means looking at lines, shapes, and colors

Our visual system works in a way that makes some things easier for us to see than others

  • “Preattentive” features
  • Gestalt Principles
  • color and contrast

Good Graphics

Graphics consist of:

  • Structure: boxplot, scatterplot, etc.

  • Aesthetics: features such as color, shape, and size that map other characteristics to structural features

Both the structure and aesthetics should help viewers interpret the information.

Gestalt Principles

Gestalt Principles

What sorts of relationships are inferred, and under what circumstances?

  • Proximity: Things that are spatially near to one another are related.
  • Similarity: Things that look alike are related.
  • Enlosure: A group of related elements are surrounded with a visual element
  • Symmetry: If an object is asymmetrical, the viewer will waste time trying to find the problem instead of concentrating on the instruction.
  • Closure: Incomplete shapes are perceived as complete.
  • Continuity: Partially hidden objects are completed into familiar shapes.
  • Connection: Things that are visually tied to one another are related.
  • Figure/Ground: Visual elements are either in the foreground or the background.

Gestalt Principles

Gestalt Hierarchy Graphs
Enclosure Facets
Connection Lines
Proximitiy White Space
Similarity Color/Shape

Implications for practice

  • Know how we perceive groups
  • Know that we perceive some groups before others
  • Design to facilitate and emphasize the most important comparisons

Pre-attentive Features

Pre-attentive Features

Pre-attentive Features

Pre-attentive Features

Pre-Attentive Features are things that “jump out” in less than 250 ms

  • Color, form, movement, spatial localization

There is a hierarchy of features + Color is stronger than shape + Combinations of pre-attentive features are usually not pre-attentive due to interference

Pre-attentive Features: Double Encoding

Pre-attentive Features: Double Encoding

Color

Color

  • Hue: shade of color (red, orange, yellow…)

  • Intensity: amount of color

  • Both color and hue are pre-attentive. Bigger contrast corresponds to faster detection.

  • Use color to your advantage

  • When choosing color schemes, we will want mappings from data to color that are not just numerically but also perceptually uniform

  • Distinguish between sequential scales and categorical scales

Color: Implications and Guidelines

  • Do not use rainbow color gradient schemes.
  • Avoid any scheme that uses green-yellow-red signaling if you have a target audience that may include colorblind people.
  • To “colorblind-proof” a graphic, you can use a couple of strategies:
    • double encoding - where you use color, use another aesthetic (line type, shape)
    • If you can print your chart out in black and white and still read it, it will be safe for colorblind users. This is the only foolproof way to do it!
    • If you are using a color gradient, use a monochromatic color scheme where possible.
    • If you have a bidirectional scale (e.g. showing positive and negative values), the safest scheme to use is purple - white - orange. In any color scale that is multi-hue, it is important to transition through white, instead of from one color to another directly.
  • Be conscious of what certain colors “mean”

Gradients

No more than 7 colors

Can use colorRampPalette() from the RColorBrewer package to produce larger palettes by interpolating existing ones

Use color gradient with only one hue for positive values

Use color gradient with two hues for positive and negative values. Gradient should go through a light, neutral color (white)

Color in ggplot2

There are packages available for use that have color scheme options.

Some Examples:

  • Rcolorbrewer
  • ggsci
  • viridis
  • wes anderson

There are packages such as RColorBrewer and dichromat that have color palettes which are aesthetically pleasing, and, in many cases, colorblind friendly.

You can also take a look at other ways to find nice color palettes.

Week 2 Assignments

Lab 2: Exploring Rodents with ggplot2

Challenge 2: Spicing things up with ggplot2

To do…

  • Lab 2: Exploring Rodents with ggplot2
    • due Friday, 1/20 at 11:59pm
  • Challenge 2: Spicing things up with ggplot2
    • due Saturday, 1/21 at 11:59pm
  • Read Chapter 3: Data Cleaning and Manipulation
    • Concept Check 3.1 + 3.2 due Monday (1/23) at 8am